Introduction:

Since even before the International Olympics Committee (IOC) was formed in 1896, the Olympic Games have been a leading international sporting competition in which thousands of athletes from around the world participate in a variety of events. The vast athleticism of these competitors, along with the specific attributes that make one an Olympian, have become of interest to many. The olympics.csv dataset used in this analysis provides insight on several different attributes of Olympic athletes along with whether or not the athletes earned an Olympic medal. This dataset includes data from 120 Olympics games, ranging from the Athens 1896 games to the Rio 2016 games. There are a total of 271,116 Olympic athletes included within this dataset containing attributes such as sex, weight, age, height, and the sporting event the athlete participated in.

One goal of this research is to determine if there are age differences for male and female Olympic gymnasts who were or were not successful in earning a medal. This analysis can be done by utilizing the sport (Text) column to narrow down the athletes to only those of interest, the gymnasts. The sex (M/F), medal (Text), and age (Number) columns are also of interest in this primary research task. Another goal of this research is to see how the age distribution has changed over the years. This can be done by comparing the following variables as columns: sex (M/F), age (Number), medal (Text), and year (Number).

Below shows the first five rows of the olypics dataset:

head(olympics)
## # A tibble: 6 × 15
##      id name      sex     age height weight team  noc   games  year season city 
##   <dbl> <chr>     <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
## 1     1 A Dijiang M        24    180     80 China CHN   1992…  1992 Summer Barc…
## 2     2 A Lamusi  M        23    170     60 China CHN   2012…  2012 Summer Lond…
## 3     3 Gunnar N… M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw…
## 4     4 Edgar Li… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
## 5     5 Christin… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
## 6     5 Christin… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
## # … with 3 more variables: sport <chr>, event <chr>, medal <chr>

Approach:

To begin the analysis, it is necessary to narrow the dataset from all Olympic competitors (271,116) to those who only competed in gymnastics (25,528) and to create a binary column for medalists. This is shown in the analysis section of the research document. The approach for the first portion of the research tasks is to use a violin plot to make a comparison of distributions between multiple groups. In this case the groups being sex, age, and medalist.

The approach for the second portion of this analysis is to create a series of boxplots over time, using a facet to make an additional comparison. This allows for a visualization of the total statistics of age for each year. The variables/columns being compared in this analysis are sex, medalist, year, age.

Analysis:

The block block of r code shows the creation of the narrowed dataset containing only those athletes who competed in gymnastics and the creation of the binary medalist column.

olympic_gymnasts <- olympics %>% 
  filter(!is.na(age)) %>%             # only keep athletes with known age
  filter(sport == "Gymnastics") %>%   # keep only gymnasts
  mutate(
    medalist = case_when(             # add column for success in medaling
      is.na(medal) ~ FALSE,           # NA values go to FALSE
      !is.na(medal) ~ TRUE            # non-NA values (Gold, Silver, Bronze) go to TRUE
    )
  )

Research Task 1

The below block of code is used to determine if there are age differences for male and female Olympic gymnasts who were or were not successful in earning a medal.

ggplot(
  olympic_gymnasts, 
  aes(factor(sex),age, fill = medalist) #plotting sex on the x axis and age on the y
  ) + 
  geom_violin( #violin to show a comparison of distributions between multiple groups
    alpha = 0.75 #using a violin plot with transparency applied
    ) +
  labs( #adding labels
    title="Olympic Gymnasts"
    ) +
  scale_x_discrete(
    name = "Sex",
    labels = c("Female", "Male") 
    ) +
  scale_y_continuous(
    name = "Age"
    ) + 
  scale_fill_brewer(
    name = "", #did not want to add a name to the legend as it is self explanatory
    labels = c("Medalist", "Not a Medalist"), 
    palette = "Set3" #using two distinct colors for medal and no medal
    ) + 
  theme_bw( #theme for aesthetics
  ) + 
  theme(
    legend.position = "top",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank()
    )

Research Task 2

The r code shown below is to see how the age distribution has changed over the years.

ggplot(
  olympic_gymnasts,
  aes(x = factor(year), y = age, fill = sex),
  options(repr.plot.width=35, repr.plot.height=17)
  ) +
  geom_boxplot( #boxplot for statistical representation of the age over years
    alpha = 0.75 #adding transparency for visualization of the outliers
    ) +
  facet_wrap( #using a facet to add an additional graph for comparison
    ~medalist, ncol = 1,
    labeller = as_labeller(c(`TRUE` = "Medalist", `FALSE` = "Not a Medalist"))
    ) +
  labs( #adding various labels
    title="Olympic Gymnasts"
    ) +
  scale_x_discrete(
    name = "Year" 
    ) +
  scale_y_continuous(
    name = "Age"
    ) + 
  scale_fill_brewer(
    name = "", #no title for the legend for visualization aesthetics
    labels = c("Female", "Male"), 
    palette = "Accent" #custom palette for male and female
    #if color legend was medal/not medal I would have been consistent with Set3 color palette
    ) + 
  theme_bw( #adding a theme for visualization 
  ) + 
  theme( #aesthetics
    legend.position = "top",
    axis.line = element_line(colour = "black"), 
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.text.x =  element_text(angle = 90)
    ) 

Discussion:

The results of the analysis for the first research task shows that male Olympians are typically older than female Olympians. The violin plots for those who received medals are rougher than the plots for those who did not receive medals, which are smooth. This may be because there are a far greater number of gymnasts who did not received medals.

The results of the second graph representing how the age distribution has changed over the years shows that the average age has generally stayed the same, however in recent years there has been a slight increase. Another trend drawn from the analysis is that the age distribution is narrowed over the years. There is also missing age data for females from the years 1896 to 1924. This is because women were not included in the Olympic Games prior to the 1928 games.

Overall, the olympics.csv dataset allowed for analysis to be done on the Olympic gymnasts and for trends to be discovered and documented. The research tasks have successfully been completed within this document and trends have been visualized.